In this project our main aim is to find how many products Trendyol sells each day. Thanks to Trendyol, they provide us a dataset with sales data and other relative data. A reliable forecast could help Trendyol to make their inventory decisions more accurate and in the marketing department more advertisements will be made to make more sales. By looking at the historical data we develop a forecasting method to predict the next day’s sales quantity. We made submissions every day in a live competition and it enhances the competitive spirit of ours as a group. The rankings are evaluated with the lowest weighted mean absolute percentage error(WMAPE) of the predictions. First we manipulate the raw data with converting it to a data table with different products as different rows. Some of the data values are missing to erase the null values in those products we take the starting point further to improve our model’s reliability. Seasonality analysis is done with the necessary decompositions. In our models 7 day frequency gives us the best ACF and PACF values so we use freq=7 in all our predictions. In the second step we recommend some ARIMA models to build on. By looking at the ACF graphs some AR and MA models are standing out from others. However with the help of auto.arima function, in most of the products it gives us the best ARIMA model. To justify the best model AIC and BIC values should be lower than other models. These are not the only test measures of the data, we create a function called “accu” to bring those test together and that helped comparing our models faster. In the competition we emphasize the WMAPE but in the study we cover MAPE, MAD, MADP as well. Second we create a function called “correlation” to find correlations between the sales data and other related columns of the data table. The main aim here is to find a regressor that can help our model grow. In the arimax models we try three of the most correlated categories. After adding the highly correlated regressor AIC values are investigated again and the lowest one is chosen. Given our test function “accu” the test measures MAD, MADP, MAPE, WMAPE values are getting better after incorporating the regressor. Lastly in the test times we were creating different models to see which works the best. After selecting the best model test days are done. We give predictions each day with respect to our model starting from June 11th until June 25th. After June 25th the submission system is closed and our predictions have finished.
The aim of this study is to develop some forecasting approaches and to manipulate the data provided by Trendyol so that necessary interpretations can be made. During the study, seasonality for each product will be examined. Then necessary decomposition will be done to use ARIMA models. Furtherly, possible regressors will be checked and ARIMAX, SARIMAX models will be utilized. Best model is going to be chosen according to their performance and test results. Necessary interpretations and visualizations will be made at each step. Lastly, forecasting will be done by using the most appropriate model selected during the steps.
Product names used in the work:
1)Yüz Temizleyici
2)Bebek Islak Mendili
3)Xiaomi Bluetooth Kulaklık
4)Fakir Süpürge
5)Tayt
6)Diş Fırçası
7)Mont
8)Bikini Model-1
9)Bikini Model-2
In this step, required libraries are called, necessary data tables are created and the other very beginning operations are executed.
library(data.table)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(mgcv)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
library(fpp)
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:nlme':
##
## getResponse
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: tseries
require(gratia)
## Loading required package: gratia
library(readxl)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(forecast)
library(zoo)
library(stringr)
library(stats)
library(urca)
g13<-read_excel("g13.xlsx")
dat<-as.data.table(g13) # Main Data Table
yuzt<-dat[1:372,] # Data table which holds the data of Product-1 (Yüz Temizleyici)
mendil<-dat[373:744,] # Data table which holds the data of Product-2 (Bebek Islak Mendili)
xiaomi<-dat[745:1116,] # Data table which holds the data of Product-3 (Xiaomi Bluetooth Kulaklık)
fakir<-dat[1117:1488,] # Data table which holds the data of Product-4 (Fakir Süpürge)
tayt<-dat[1489:1860,] # Data table which holds the data of Product-5 (Tayt)
bik1<-dat[1861:2232,] # Data table which holds the data of Product-6 (Bikini Model-1)
firca<-dat[2233:2604,] # Data table which holds the data of Product-7 (Diş Fırçası)
mont<-dat[2605:2976,] # Data table which holds the data of Product-8 (Mont)
bik2<-dat[2977:3348,] # Data table which holds the data of Product-9 (Bikini Model-2)
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-1 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 15th and 60th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-1 ought to be chosen at the frequency of 15, which will cover autocorrelation of lag 60 too. However, as it will be observed in the next step; if lag 7 is chosen, the decomposition results will be better compared to lag 15. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(yuzt$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(yuzt$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
The reason why acf is really high at lag 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month, which justifies high acf value around lag 15. (Source) Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. So, another seasonality is observed roughly at each 15 days. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.
Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to increase more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
Required codes are below:
ts13=ts(yuzt$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3.AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.1893 -0.1191 -0.3100 0.9748
## s.e. 0.0496 0.0503 0.0497 0.0127
##
## sigma^2 estimated as 0.09049: log likelihood = -79.89, aic = 169.78
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of the nature of AR models.
Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs.
MA(3) is creted below:
model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## 0.0885 -0.2125 -0.3677 0.9748
## s.e. 0.0472 0.0503 0.0440 0.0080
##
## sigma^2 estimated as 0.08918: log likelihood = -77.27, aic = 164.54
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals show constant mean but variance seems to differ more in this model. Eventhough ACF and PACF values are better in this model, they are still high.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(3,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 mean
## 0.2382 0.2147 -0.3872 -0.1166 -0.4240 0.9748
## s.e. 0.2073 0.1875 0.0779 0.2199 0.1803 0.0076
##
## sigma^2 estimated as 0.08708: log likelihood=-69.97
## AIC=153.93 AICc=154.25 BIC=181.25
model13=arima(decomposed11$random,order=c(3,0,2)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(3,0,2)")
Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 169.7801
BIC(model11)
## [1] 189.2933
AIC(model12)
## [1] 164.5391
BIC(model12)
## [1] 184.0523
AIC(model13)
## [1] 153.9338
BIC(model13)
## [1] 181.2523
As it is seen above, model13 which is autoarima with ARIMA(3.0.2) is the best model with lower values. ARIMA(3,0,2) should be selected for further steps.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2702096 17.51891 0.2361159 0.2361159
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2703676 17.21833 0.2320647 0.2320647
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2619928 16.86907 0.2273575 0.2273575
Again third result which is ARIMA(3,0,2) gives the best result with lower errors in all error tests methods.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(yuzt$sold_count,yuzt$price)
## [1] -0.2299915
correlation(yuzt)
## Variable Correlation
## 1 visit count 0.3144539
## 2 basket count 0.8194234
## 3 favored count 0.4509207
## 4 category sold 0.6230038
## 5 category visit 0.1228725
## 6 category basket 0.2880863
## 7 category favored 0.6858087
## 8 category brand sold 0.3925289
Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$basket_count)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept yuzt$basket_count
## 0.1656 0.1448 -0.3539 2e-04 -0.2923 0.8806 3e-04
## s.e. 0.1401 0.1554 0.0039 NaN NaN NaN NaN
##
## sigma^2 estimated as 0.07997: log likelihood = -57.27, aic = 130.53
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$category_sold)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept yuzt$category_sold
## -0.0960 -0.0457 -0.2361 0.3688 0.0609 0.8011 4e-04
## s.e. 0.1352 0.1309 NaN NaN NaN NaN NaN
##
## sigma^2 estimated as 0.07733: log likelihood = -51.05, aic = 118.1
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$category_favored)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$category_favored)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept
## 0.1691 0.1876 -0.3685 -0.0422 -0.3731 0.8898
## s.e. 0.1974 0.1762 0.0707 0.2061 0.1654 0.0137
## yuzt$category_favored
## 0
## s.e. NaN
##
## sigma^2 estimated as 0.08231: log likelihood = -62.61, aic = 141.22
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004031973 0.4117975 0.005550114 0.005550114
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004048837 0.3541774 0.004773523 0.004773523
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004150725 0.3935481 0.005304152 0.005304152
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004010801 0.4028655 0.005429729 0.005429729
As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.
After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.
Predicting new basket count values with moving averages method:
basket1<-tail(yuzt$basket_count,7)
for (i in 1:26) {
basket1<-append(basket1,(basket1[i]+basket1[i+1]+basket1[i+2]+basket1[i+3]+basket1[i+4]+basket1[i+5]+basket1[i+6])/7)
}
basket_rounded<-round(basket1)
basket_rounded
## [1] 534 982 515 397 570 633 744 625 638 589 599 628 637 637 622 621 619 623 627
## [20] 627 625 624 624 624 625 625 625 624 624 624 625 625 625
Forecasted values are below:
basket_count_forxreg<-c(tail(yuzt$basket_count,3),basket_rounded[8:33])
preds<-predict(modelx1,n.ahead = 29,newxreg = basket_count_forxreg)$pred
preds<-ts(preds,frequency = 7)
last_trend_value <-tail(decomposed11$trend[!is.na(decomposed11$trend)],1)
seasonality=decomposed11$seasonal[6:34]
print(preds)
## Time Series:
## Start = c(1, 1)
## End = c(5, 1)
## Frequency = 7
## [1] 1.083829 1.023652 1.039651 1.019449 1.050139 1.048070 1.052394 1.052396
## [9] 1.049849 1.047322 1.044917 1.046304 1.047171 1.048143 1.048815 1.048238
## [17] 1.047585 1.047355 1.047548 1.047628 1.047923 1.047871 1.047838 1.047548
## [25] 1.047558 1.047567 1.047846 1.047846 1.047844
ts.plot(preds,xlab = "Weeks", ylab="Predictions", main="Forecasts")
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-2 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 15th and 60th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-2 ought to be chosen at the frequency of 15, which will cover autocorrelation of lag 60 too. However, just like the previous case in product-1; if lag 7 is chosen, the decomposition results will be better compared to lag 15. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(mendil$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(mendil$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to increase more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
Required codes are below:
ts13=ts(mendil$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.3581 -0.2436 -0.2027 0.9602
## s.e. 0.0511 0.0528 0.0510 0.0150
##
## sigma^2 estimated as 0.09778: log likelihood = -94.08, aic = 198.16
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs.
MA(3) is creted below:
model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## 0.2405 -0.2537 -0.3837 0.9598
## s.e. 0.0483 0.0581 0.0469 0.0099
##
## sigma^2 estimated as 0.09676: log likelihood = -92.23, aic = 194.46
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals show nearly constant mean but variance seems to jump in this model. Eventhough ACF and PACF values are better in this model, they are still high.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 1.0134 -0.5088 -0.7209 0.9597
## s.e. 0.0616 0.0450 0.0593 0.0091
##
## sigma^2 estimated as 0.09503: log likelihood=-86.95
## AIC=183.9 AICc=184.07 BIC=203.41
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 198.1614
BIC(model11)
## [1] 217.6746
AIC(model12)
## [1] 194.4632
BIC(model12)
## [1] 213.9764
AIC(model13)
## [1] 183.899
BIC(model13)
## [1] 203.4122
As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.2991687 104.9772 0.2725654 0.2725654
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.3054549 102.8093 0.2669366 0.2669366
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.298858 102.0573 0.2649839 0.2649839
Again third result which is ARIMA(2,0,1) gives the best result in MAD, MAPE, MADP and nearly equal to MA model in WMAPE,which is better than AR(3) model. According to performances, ARIMA model is better than MA model, MA model is better than AR model.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(mendil$sold_count,mendil$price)
## [1] -0.5721201
correlation(mendil)
## Variable Correlation
## 1 visit count 0.3002766
## 2 basket count 0.8872024
## 3 favored count 0.2832415
## 4 category sold 0.9157251
## 5 category visit 0.4285587
## 6 category basket 0.1962214
## 7 category favored 0.7963313
## 8 category brand sold 0.1320124
Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept mendil$basket_count
## 0.5269 0.0123 0.1696 0.6284 3e-04
## s.e. 0.3535 0.2216 0.3458 0.0303 NaN
##
## sigma^2 estimated as 0.06762: log likelihood = -26.61, aic = 65.22
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$category_sold)
##
## Coefficients:
## ar1 ar2 ma1 intercept mendil$category_sold
## 0.4713 -0.2106 0.0706 0.7133 1e-04
## s.e. 0.1489 0.0604 0.1319 0.0302 NaN
##
## sigma^2 estimated as 0.07551: log likelihood = -46.71, aic = 105.43
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$category_favored)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$category_favored)
##
## Coefficients:
## ar1 ar2 ma1 intercept mendil$category_favored
## 0.6261 -0.3234 -0.1738 0.7596 0
## s.e. 0.1341 0.0466 0.1242 0.0306 NaN
##
## sigma^2 estimated as 0.08683: log likelihood = -72.27, aic = 156.53
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 183.899
AIC(modelx1)
## [1] 65.22218
AIC(modelx2)
## [1] 105.4285
AIC(modelx3)
## [1] 156.5311
For initial comparision, model arimax with regressor as basket count seems to have better AIC value.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.001917651 1.233079 0.003201597 0.003201597
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.002547509 1.521034 0.003949249 0.003949249
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.001909252 1.182135 0.003069323 0.003069323
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 385.1452 422.3167 1.096513 0.002345371 1.412103 0.003666417 0.003666417
As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.
After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.
Predicting new basket count values with moving averages method:
basket1<-tail(mendil$basket_count,7)
for (i in 1:26) {
basket1<-append(basket1,(basket1[i]+basket1[i+1]+basket1[i+2]+basket1[i+3]+basket1[i+4]+basket1[i+5]+basket1[i+6])/7)
}
basket_rounded<-round(basket1)
basket_rounded
## [1] 1740 1771 1646 1507 1782 1716 5096 2180 2243 2310 2405 2533 2640 2772 2440
## [16] 2478 2511 2540 2559 2563 2552 2520 2532 2540 2544 2544 2542 2539 2537 2540
## [31] 2541 2541 2541
Forecasted values are below:
basket_count_forxreg<-c(tail(mendil$basket_count,3),basket_rounded[8:33])
preds<-predict(modelx1,n.ahead = 29,newxreg = basket_count_forxreg)$pred
preds<-ts(preds,frequency = 7)
last_trend_value <-tail(decomposed11$trend[!is.na(decomposed11$trend)],1)
seasonality=decomposed11$seasonal[6:34]
preds=preds*last_trend_value*seasonality
print(preds)
## Time Series:
## Start = c(1, 1)
## End = c(5, 1)
## Frequency = 7
## [1] 609.7210 674.8867 1763.6751 1170.1582 1148.4721 1059.0367 834.9282
## [8] 748.5543 858.2950 1175.4905 1248.9548 1216.2317 1111.1613 861.8301
## [15] 753.1620 843.6258 1119.4470 1272.3600 1231.4754 1118.5839 862.6226
## [22] 750.5944 839.6008 1116.1323 1277.3250 1233.7306 1118.8396 862.0319
## [29] 750.0804
plot(preds,xlab="Weeks", main="Forecasts", ylab="Predictions")
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-3 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 7th and 30th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-3 ought to be chosen at the frequency of 7 or 30 but we had better determine the seasonality after decomposition stage. There is weekly and maybe monthly seasonality for this product.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(xiaomi$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance (except for some jumps) and nearly constant mean. Its trend seems to gradually increase and decrease with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts13=ts(xiaomi$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component does not show any clue to increase or decrease. However, acf of detrended values are not that bad. But its random values are worse than weekly one. Seasonal component seems to have larger deviation, which may fluctuate results more.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.1838 -0.1684 -0.3338 0.9851
## s.e. 0.0492 0.0493 0.0491 0.0094
##
## sigma^2 estimated as 0.05641: log likelihood = 6.56, aic = -3.11
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance increases.
MA(3) is creted below:
model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## 0.0280 -0.2659 -0.4802 0.9848
## s.e. 0.0423 0.0519 0.0475 0.0034
##
## sigma^2 estimated as 0.05211: log likelihood = 20.81, aic = -31.62
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals are really similar to the first model. AIC value is better in this model.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(5,0,3) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.2611 -0.6062 0.1784 -0.1918 -0.2221 -0.1929 0.3040 -0.5984
## s.e. 0.1403 0.1237 0.1092 0.0840 0.0689 0.1381 0.1086 0.0989
## mean
## 0.9851
## s.e. 0.0038
##
## sigma^2 estimated as 0.0501: log likelihood=32.35
## AIC=-44.69 AICc=-44.07 BIC=-5.67
model13=arima(decomposed11$random,order=c(5,0,3)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(5,0,3)")
Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] -3.110113
BIC(model11)
## [1] 16.40305
AIC(model12)
## [1] -31.62326
BIC(model12)
## [1] -12.1101
AIC(model13)
## [1] -44.69223
BIC(model13)
## [1] -5.665892
As it is seen above, model13 which is autoarima with ARIMA(5.0.3) is the best in AIC value. However, MA model is better than ARIMA in BIC because of parameter punsihment of BIC calculation.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1860551 66.12454 0.1686712 0.1686712
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1814687 63.31901 0.1615148 0.1615148
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1754918 61.90513 0.1579083 0.1579083
ARIMA(5,0,3) gives the best result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(5,0,3) should be selected for further steps.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(xiaomi$sold_count,xiaomi$price)
## [1] -0.5127488
correlation(xiaomi)
## Variable Correlation
## 1 visit count 0.21153610
## 2 basket count 0.86567764
## 3 favored count 0.22808036
## 4 category sold 0.53232294
## 5 category visit 0.01179411
## 6 category basket 0.06057578
## 7 category favored 0.27347833
## 8 category brand sold 0.06973421
Obviously there is high correlation between sold counts and those basket count, category sold and price. This result actually makes sense because people tend to buy what they favored or basketed.Moreover, high prices makes people buy less. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$basket_count)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.2035 -0.7361 -0.0094 -0.1607 -0.3037 -0.0914 0.4821 -0.3312
## s.e. 0.1565 0.1304 0.1136 0.0761 0.0642 0.1576 0.1160 0.0926
## intercept xiaomi$basket_count
## 0.8694 1e-04
## s.e. 0.0091 NaN
##
## sigma^2 estimated as 0.04347: log likelihood = 53.98, aic = -85.97
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$category_sold)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## -0.0373 0.2239 0.5257 -0.1746 0.2020 0.6214 0.0901 -0.5859
## s.e. 0.1237 0.0956 0.0976 0.0958 0.0579 0.1326 0.1572 0.1344
## intercept xiaomi$category_sold
## 0.4151 7e-04
## s.e. 0.0697 1e-04
##
## sigma^2 estimated as 0.03922: log likelihood = 72.69, aic = -123.39
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$price)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$price)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.2516 -0.6212 0.1537 -0.1925 -0.2367 -0.1888 0.3213 -0.5803
## s.e. 0.1470 0.1232 0.1084 0.0851 0.0686 0.1467 0.1095 0.1029
## intercept xiaomi$price
## 1.1520 -0.0012
## s.e. 0.0557 0.0004
##
## sigma^2 estimated as 0.04731: log likelihood = 38.31, aic = -54.61
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] -44.69223
AIC(modelx1)
## [1] -85.96773
AIC(modelx2)
## [1] -123.3877
AIC(modelx3)
## [1] -54.6138
For initial comparision, although all AIC values are negative and quite few, model arimax with regressor as category sold seems to have better AIC value. However, test results may differ.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.003007114 1.334838 0.003404918 0.003404918
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.002735178 1.240497 0.003164274 0.003164274
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.003638369 1.522088 0.003882558 0.003882558
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 392.0323 218.106 0.5563472 0.002616265 1.19986 0.003060615 0.003060615
To interpret the test, model arimax3 (arimax with regressor as price) gives the least parameter results. In all parameters such as WMAPE, arimax3 is the best eventhough its relatively higher AIC value.
After all of tests and operations done above, arimax3 (regressor as price) model gives best results and it is selected as final model.
Predicting new basket count values with moving averages method:
price1<-tail(xiaomi$price,7)
for (i in 1:26) {
price1<-append(price1,(price1[i]+price1[i+1]+price1[i+2]+price1[i+3]+price1[i+4]+price1[i+5]+price1[i+6])/7)
}
price1
## [1] 130.1068 129.5155 126.6857 126.5967 123.7674 122.3905 120.6812 125.6777
## [9] 125.0450 124.4063 124.0807 123.7213 123.7147 123.9038 124.3642 124.1766
## [17] 124.0525 124.0020 123.9907 124.0292 124.0741 124.0985 124.0605 124.0439
## [25] 124.0427 124.0485 124.0568 124.0607 124.0588 124.0531 124.0521 124.0532
## [33] 124.0548
Forecasted values are below:
price_forxreg<-c(tail(yuzt$price,3),price1[8:33])
preds<-predict(modelx3,n.ahead = 29,newxreg = price_forxreg)$pred
preds<-ts(preds,frequency = 7)
last_trend_value <-tail(decomposed11$trend[!is.na(decomposed11$trend)],1)
seasonality=decomposed11$seasonal[6:34]
preds=preds*last_trend_value*seasonality
print(preds)
## Time Series:
## Start = c(1, 1)
## End = c(5, 1)
## Frequency = 7
## [1] 267.1686 253.2808 299.1230 284.1226 291.0920 323.8695 299.0740 265.0663
## [9] 253.6065 292.4543 297.7587 297.9049 313.2368 290.5201 266.0455 253.4606
## [17] 290.8184 301.8939 301.9038 311.1497 288.5967 266.8977 253.0083 289.3073
## [25] 302.8601 303.3456 310.5270 288.1521 267.4918
plot(preds,xlab="Weeks", main="Forecasts", ylab="Predictions")
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-4 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 7th and 14th lag, autocorrelation values get higher and autocorrelation fades away at later lags. But product-4 should be chosen at the frequency of 7 because if we choose lag7 it may cover lag 14 as well. So there is weekly seasonality.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(fakir$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance (except for some jumps) and nearly constant mean. Its trend does not tell us anything important. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(fakir$sold_count, frequency = 14)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=14")
plot(decomposed12)
Autocorrelation seems higher than the previous one. Reason might be that we do not cover weekly pattern but if we choose lag 7, we can cover lag 14. So frequency=14 is worse than frequency=7.
Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to decrease more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
Required codes are below:
ts13=ts(fakir$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component does not show any clue to increase or decrease because it decreases first then increases in the middle. Acf of detrended values are not very well. But its random values are worse than weekly one because of changing variance and barely constant mean. Seasonal component seems to have larger terms because of larger frequency.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.1085 -0.0945 -0.3236 0.9801
## s.e. 0.0494 0.0495 0.0500 0.0108
##
## sigma^2 estimated as 0.07311: log likelihood = -40.83, aic = 91.67
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance increases.
MA(3) is creted below:
model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## -0.0202 -0.2480 -0.3644 0.9800
## s.e. 0.0478 0.0487 0.0430 0.0051
##
## sigma^2 estimated as 0.07012: log likelihood = -33.33, aic = 76.65
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals are closer to 0 in this model rather than first model. AIC value is better in this model.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(4,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 mean
## 0.7425 -0.1878 -0.2388 0.0305 -0.7569 0.9799
## s.e. 0.0810 0.0639 0.0643 0.0638 0.0613 0.0051
##
## sigma^2 estimated as 0.06838: log likelihood=-25.8
## AIC=65.61 AICc=65.92 BIC=92.93
model13=arima(decomposed11$random,order=c(4,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(4,0,1)")
Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 91.6659
BIC(model11)
## [1] 111.1791
AIC(model12)
## [1] 76.65494
BIC(model12)
## [1] 96.1681
AIC(model13)
## [1] 65.60766
BIC(model13)
## [1] 92.9261
As it is seen above, model13 which is autoarima with ARIMA(4.0.1) is the best in AIC value. Second best model is MA(3).
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2396726 9.1253 0.2297071 0.2297071
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2376557 8.985335 0.2261838 0.2261838
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2304129 8.615689 0.2168789 0.2168789
ARIMA(4,0,1) gives the best result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(4,0,1) should be selected for further steps.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(fakir$sold_count,fakir$price)
## [1] -0.32637
correlation(fakir)
## Variable Correlation
## 1 visit count -0.102187158
## 2 basket count 0.866866490
## 3 favored count -0.145043073
## 4 category sold 0.759162610
## 5 category visit 0.004137553
## 6 category basket -0.122458472
## 7 category favored 0.567747022
## 8 category brand sold -0.170524082
Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$basket_count)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 intercept fakir$basket_count
## 0.6858 -0.1792 -0.2419 0.0176 -0.6888 0.9641 1e-04
## s.e. 0.1246 0.0635 0.0628 0.0689 0.1272 0.0196 2e-04
##
## sigma^2 estimated as 0.06706: log likelihood = -25.18, aic = 66.37
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$category_sold)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 intercept fakir$category_sold
## 0.5079 -0.1532 -0.2452 0.0022 -0.4742 0.9095 4e-04
## s.e. 0.2094 0.0633 0.0596 0.0809 0.2270 0.0285 2e-04
##
## sigma^2 estimated as 0.06431: log likelihood = -17.39, aic = 50.78
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$category_favored)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$category_favored)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 intercept
## 0.7029 -0.1821 -0.2399 0.0218 -0.7198 0.9571
## s.e. 0.0803 0.0625 0.0632 0.0655 0.0555 0.0069
## fakir$category_favored
## 0
## s.e. NaN
##
## sigma^2 estimated as 0.06646: log likelihood = -23.58, aic = 63.17
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 65.60766
AIC(modelx1)
## [1] 66.36656
AIC(modelx2)
## [1] 50.77947
AIC(modelx3)
## [1] 63.16558
For initial comparision, model arimax2 (regressor as category sold) has the least AIC value. This might be a sign for selection.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP
## 1 372 39.72581 39.17959 0.9862504 0.005127835 0.05148335 0.001295967
## WMAPE
## 1 0.001295967
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.00489123 0.04979956 0.001253582 0.001253582
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP
## 1 372 39.72581 39.17959 0.9862504 0.004614496 0.04775374 0.001202083
## WMAPE
## 1 0.001202083
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP
## 1 372 39.72581 39.17959 0.9862504 0.005139033 0.05145274 0.001295197
## WMAPE
## 1 0.001295197
To interpret the test, all 4 models gives really close results with high performances. We actually can choose whichever we request. Eventhough arimax2 has the least AIC value, test results are close and in order to prevent possible forecast mistakes, model should be chosen as the simplest one, which is the first, classic arima(4,0,1) model without regressors.
After all of tests and operations done above, arima(4,0,1) model is selected as final model due to simplicity.
Forecasted Values:
preds<-predict(model13,n.ahead = 29)$pred
preds<-ts(preds,frequency = 7)
last_trend_value <-tail(decomposed11$trend[!is.na(decomposed11$trend)],1)
seasonality=decomposed11$seasonal[6:34]
preds=preds*last_trend_value*seasonality
print(preds)
## Time Series:
## Start = c(1, 1)
## End = c(5, 1)
## Frequency = 7
## [1] 12.12999 13.23874 11.88360 14.11928 14.35010 13.86052 11.62507 11.65004
## [9] 13.56411 12.52490 14.70478 14.53641 13.73103 11.42786 11.50201 13.51428
## [17] 12.56911 14.79148 14.59744 13.74544 11.41304 11.47929 13.49608 12.56550
## [25] 14.79848 14.60718 13.75141 11.41390 11.47733
plot(preds,xlab="Weeks", main="Forecasts", ylab="Predictions")
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-5 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 14th and 15th lag, autocorrelation values get higher and autocorrelation fades away at later lags. But frequency of 7 will be used too in decomposition part because it has given best results so far in the other products. Finally, it would not be wrong to say that there is similar pattern at each 14 days.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(tayt$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values can be counted as stationary with stable variance and approximately constant mean. Its trend does not tell us anything important being wavy. But model might be still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts13=ts(tayt$sold_count, frequency = 14)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=14")
plot(decomposed13)
Autocorrelation seems little higher than the previous one.
Detrended values (random) are less stationary in this case. They do not have that constant variance and their means are less constant compared to previous one. Trend in this series is not linear, which makes forecasting suffer. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values show signs of nonstationarity.
Required codes are below:
ts12=ts(tayt$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
plot(decomposed12)
Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component is again not linear it is more like fluctuating. Acf of detrended values are not bad. Nevertheless, its random values are worse than weekly one because of changing variance and barely constant mean. Seasonal component seems to have larger terms in number because of larger frequency.
For the further parts, eventhough they give not very appropriate results, we will continue with the model of frequency = 7 due to slightly better detrended values compared to other options (Slightly more stationary with more constant mean and variance).
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 2 .AR(2) model can be proposed (autocorrelation is negative at lag 2, this is another sign for AR(2)). For decreasing PACF part, ACF pilot does not show significant figures after lag 5, so MA(5) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(2) is creted below:
model11=arima(decomposed11$random,order=c(2,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 0.5551 -0.3861 0.9639
## s.e. 0.0488 0.0492 0.0182
##
## sigma^2 estimated as 0.08323: log likelihood = -64.61, aic = 137.22
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance changes a lot. There is still some autocorrelation from ACF and PACF
MA(5) is creted below:
model12=arima(decomposed11$random,order=c(0,0,5))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 5))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 intercept
## 0.4285 -0.1421 -0.3891 -0.2371 -0.0751 0.9629
## s.e. 0.0522 0.0578 0.0520 0.0532 0.0526 0.0086
##
## sigma^2 estimated as 0.07724: log likelihood = -51.05, aic = 116.09
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals look same as AR(2) model. But from summaries above, AIC value is better (lower) in this model. ACF and PACF plots shows autocorrelation handled better in this model.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 1.0850 -0.6042 -0.6853 0.9633
## s.e. 0.0534 0.0426 0.0515 0.0087
##
## sigma^2 estimated as 0.07533: log likelihood=-44.56
## AIC=99.11 AICc=99.28 BIC=118.63
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals are still similar to previous 2 models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 137.224
BIC(model11)
## [1] 152.8346
AIC(model12)
## [1] 116.0926
BIC(model12)
## [1] 143.411
AIC(model13)
## [1] 99.11418
BIC(model13)
## [1] 118.6273
As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best in AIC and BIC value. Second best model is MA(5).
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.2689123 233.0372 0.2715396 0.2715396
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.2695938 222.8493 0.2596685 0.2596685
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.2641632 217.3164 0.2532214 0.2532214
Test results are quite little and difference between them are close but if we have to choose one, ARIMA(2,0,1) gives slightly better result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(2,0,1) should be selected for further steps.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(tayt$sold_count,tayt$price)
## [1] -0.2583435
correlation(tayt)
## Variable Correlation
## 1 visit count 0.05290733
## 2 basket count 0.83720969
## 3 favored count 0.16874292
## 4 category sold 0.90004658
## 5 category visit 0.05368907
## 6 category basket -0.08685947
## 7 category favored 0.57715502
## 8 category brand sold -0.03864685
Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept tayt$basket_count
## 0.9270 -0.4999 -0.4183 0.8575 0
## s.e. 0.0778 0.0390 0.0688 0.0199 NaN
##
## sigma^2 estimated as 0.07117: log likelihood = -35.99, aic = 83.99
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$category_sold)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$category_sold)
##
## Coefficients:
## ar1 ar2 ma1 intercept tayt$category_sold
## 0.6093 -0.0576 0.2404 0.6511 2e-04
## s.e. 0.1538 0.1151 0.1450 0.0398 NaN
##
## sigma^2 estimated as 0.05661: log likelihood = 5.79, aic = 0.42
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$category_favored)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$category_favored)
##
## Coefficients:
## ar1 ar2 ma1 intercept tayt$category_favored
## 1.0271 -0.5787 -0.6007 0.8578 0
## s.e. 0.0526 0.0405 0.0372 0.0211 NaN
##
## sigma^2 estimated as 0.07161: log likelihood = -37.22, aic = 86.44
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 99.11418
AIC(modelx1)
## [1] 83.98627
AIC(modelx2)
## [1] 0.4222833
AIC(modelx3)
## [1] 86.43665
For initial comparision, model arimax2 (regressor as category sold) has absolutely the least AIC value. If test results for last 1 week does not change the condition, it seems arimax2 is best option to select.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.001727326 0.4430522 0.0005162533 0.0005162533
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.0019466 0.5126242 0.00059732 0.00059732
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.002237589 0.5641885 0.0006574038 0.0006574038
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 858.207 1099.162 1.280766 0.001601462 0.4117527 0.0004797825 0.0004797825
To interpret the test, all 4 models gives really close results with high performances. Actually model arimax3 (regressor as category favored) gives the lowest WMAPE,MADP,MAD values but the difference between them is neglectable. So what to choose can be decided according to their AIC values. Arimax2 (regressor as category sold) has given remarkably low AIC value which is close to 0 (others were around 80). So we improve the model to arimax model with taking category sold as regressor.
After all of tests and operations done above, arimax2 (regressor as category sold) model is selected as final model due to outstandingly low AIC value.
Predicting new basket count values with moving averages method:
category_sold1<-tail(tayt$category_sold,7)
for (i in 1:26) {
category_sold1<-append(category_sold1,(category_sold1[i]+category_sold1[i+1]+category_sold1[i+2]+category_sold1[i+3]+category_sold1[i+4]+category_sold1[i+5]+category_sold1[i+6])/7)
}
category_rounded<-round(category_sold1)
category_rounded
## [1] 902 808 1046 955 7447 7757 8155 3867 4291 4788 5323 5947 5733 5443 5056
## [16] 5226 5359 5441 5458 5388 5339 5324 5362 5382 5385 5377 5365 5362 5365 5371
## [31] 5372 5371 5369
Forecasted values are below:
category_sold_forxreg<-c(tail(tayt$category_sold,3),category_rounded[8:33])
preds<-predict(modelx2,n.ahead = 29,newxreg = category_sold_forxreg)$pred
preds<-ts(preds,frequency = 7)
last_trend_value <-tail(decomposed11$trend[!is.na(decomposed11$trend)],1)
seasonality=decomposed11$seasonal[6:34]
preds=preds*last_trend_value*seasonality
print(preds)
## Time Series:
## Start = c(1, 1)
## End = c(5, 1)
## Frequency = 7
## [1] 475.2683 500.9323 560.5512 441.8567 481.0352 504.9829 416.2135 410.1603
## [9] 412.1301 432.5117 510.5128 536.8137 538.6822 421.6004 389.4661 397.1311
## [17] 427.6158 526.0474 544.9461 540.0419 419.0315 386.0401 396.1315 428.6978
## [25] 528.4242 545.4843 539.4505 418.3894 385.7017
plot(preds,xlab="Weeks", main="Forecasts", ylab="Predictions")
In this step, seasonality of each product is analysed to make appropriate decomposition.
The plot below shows the sold count of product-6 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.
Table below shows autocorrelation values at each lag until lag 100.
In this case, autocorrelation is too much at each lag. But frequency of 7,14 and 30 will be examined in decomposition part and seasonality will be selected accordingly. As it will be appeared in the next part, frequancy=7 gives best stationary case and lowest acf values after decomposition. So it would not be wrong to say that there is approximately weekly seasonality.
As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(firca$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series seems really low and acceptable except for a few lags. As it is seen in the graph for random (detrended) values are not quite stationary with decreasing variance and changing mean. Its trend tend to increase. But model might be still selected because the other options are even worse. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts13=ts(firca$sold_count, frequency = 14)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=14")
plot(decomposed13)
Autocorrelation seems little higher than the previous one but still better than the one before decomposition.
Detrended values (random) are even less stationary in this case. They have jumps in variance which decreases too. But their means are more constant compared to previous one. Trend in this series is inclined to go up. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values show signs of nonstationarity more than the one with frequancy=7.
Required codes are below:
ts12=ts(firca$sold_count, frequency = 30)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=30")
plot(decomposed12)
Detrended values are the worst in this model. They show clear nonstationary behavior due to changing variance and mean. Trend component is again inclined to increase. Acf of detrended values are not quite bad. Nevertheless, its random values are worse than weekly one because of changing variance and inconstant mean. Seasonal component seems to have larger terms in number because of larger frequency.
For the further parts, eventhough all of 3 are not quite good, frequency=7 is acceptable and the best one.So, we will continue with the model of frequency = 7 due to slightly better detrended values compared to other options (Slightly more stationary with more constant mean and variance).
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 2 .AR(2) model can be proposed (autocorrelation is negative at lag 2, this is another sign for AR(2)). For decreasing PACF part, ACF pilot does not show significant figures after lag 5, so MA(5) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.3530 -0.1543 -0.1970 0.9439
## s.e. 0.0512 0.0538 0.0511 0.0212
##
## sigma^2 estimated as 0.1637: log likelihood = -188.37, aic = 386.75
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.
Residuals show constant mean but variance changes at some points. There is still some autocorrelation from ACF and PACF
MA(5) is creted below:
model12=arima(decomposed11$random,order=c(0,0,5))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 5))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 intercept
## 0.3168 -0.1039 -0.2983 -0.1865 -0.0843 0.9434
## s.e. 0.0525 0.0522 0.0518 0.0482 0.0520 0.0135
##
## sigma^2 estimated as 0.1589: log likelihood = -182.91, aic = 379.82
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too.
Residuals look similar to AR(3) model with little better variance. But from summaries above, AIC value is better (lower) in this model. ACF and PACF plots shows autocorrelation is handled better in this model.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 1.0468 -0.4621 -0.7305 0.9436
## s.e. 0.0645 0.0462 0.0580 0.0136
##
## sigma^2 estimated as 0.1598: log likelihood=-182
## AIC=374.01 AICc=374.17 BIC=393.52
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black).
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals are still similar to previous 2 models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF. Sometimes auto.arima may not be able to find best model but in this condition, it worked very well.
AIC(model11)
## [1] 386.7499
BIC(model11)
## [1] 406.2631
AIC(model12)
## [1] 379.822
BIC(model12)
## [1] 407.1404
AIC(model13)
## [1] 374.0077
BIC(model13)
## [1] 393.5209
As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best in AIC and BIC value. Second best model is MA(5).
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 Inf 20.02766 0.217197 0.217197
accu(ts11,tail(head(fitted_transformed12,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 Inf 19.9664 0.2165326 0.2165326
accu(ts11,tail(head(fitted_transformed13,369),366))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 Inf 19.68207 0.2134491 0.2134491
Test results tell all 3 models are close to each other but arima(2,0,1) model is slightly better with higher performance. It has lower error terms. So it will be continued with arima(2,0,1).
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
cor(firca$sold_count,firca$price)
## [1] NA
correlation(firca)
## Variable Correlation
## 1 visit count 0.8409331
## 2 basket count 0.9515283
## 3 favored count 0.7896215
## 4 category sold 0.3998297
## 5 category visit 0.1229495
## 6 category basket 0.5771166
## 7 category favored 0.4229052
## 8 category brand sold 0.4911847
Obviously there is high correlation between sold counts and those visit count, basket count and favored count. Price data is dirty in this product so best operation to do is not to use price as regressor. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potentiel regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=firca$visit_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$visit_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept firca$visit_count
## 1.0622 -0.4665 -0.7616 0.9207 0
## s.e. 0.0279 0.0344 NaN NaN NaN
##
## sigma^2 estimated as 0.1545: log likelihood = -177.89, aic = 367.79
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=firca$basket_count)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept firca$basket_count
## 1.0356 -0.4542 -0.7256 0.8750 2e-04
## s.e. 0.0703 0.0485 0.0756 0.0232 1e-04
##
## sigma^2 estimated as 0.1478: log likelihood = -169.67, aic = 351.34
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=firca$favored_count)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$favored_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept firca$favored_count
## 1.0578 -0.4650 -0.7599 0.9097 1e-04
## s.e. 0.0615 0.0464 0.0542 0.0158 1e-04
##
## sigma^2 estimated as 0.1531: log likelihood = -176.22, aic = 364.44
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 374.0077
AIC(modelx1)
## [1] 367.7873
AIC(modelx2)
## [1] 351.3435
AIC(modelx3)
## [1] 364.4441
For initial comparision, model arimax2 (regressor as basket count,which has the most correlation value with 0.95) has the least AIC value. If test results for last 1 week does not change the condition, it seems arimax2 is best option to select.
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004081113 0.725805 0.007871245 0.007871245
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004215994 0.6889001 0.007471018 0.007471018
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004182911 0.6607526 0.007165762 0.007165762
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004102785 0.6787281 0.007360704 0.007360704
To interpret the test model arimax2 (regressor as basket count) gives the lowest WMAPE,MADP,MAD values. So model is enhanced with new regressor.
Predicting new basket count values with moving averages method:
basket1<-tail(firca$basket_count,7)
for (i in 1:26) {
basket1<-append(basket1,(basket1[i]+basket1[i+1]+basket1[i+2]+basket1[i+3]+basket1[i+4]+basket1[i+5]+basket1[i+6])/7)
}
basket_rounded<-round(basket1)
basket_rounded
## [1] 507 937 911 739 615 604 608 703 731 702 672 662 669 678 688 686 679 676 677
## [20] 679 680 681 680 679 679 679 680 680 680 679 679 679 679
Forecasted values are below:
basket_count_forxreg<-c(tail(firca$basket_count,3),basket_rounded[8:33])
preds<-predict(modelx2,n.ahead = 29,newxreg = basket_count_forxreg)$pred
preds<-ts(preds,frequency = 7)
last_trend_value <-tail(decomposed11$trend[!is.na(decomposed11$trend)],1)
seasonality=decomposed11$seasonal[6:34]
preds=preds*last_trend_value*seasonality
print(preds)
## Time Series:
## Start = c(1, 1)
## End = c(5, 1)
## Frequency = 7
## [1] 135.8001 157.4911 169.6182 192.4264 188.8385 169.1245 155.4458 143.0871
## [9] 161.0363 170.2922 189.3741 185.4870 167.7103 155.5687 143.7662 161.6624
## [17] 170.5639 189.2016 185.2489 167.6467 155.6059 143.7984 161.6908 170.5736
## [25] 189.1795 185.2228 167.6484 155.6050 143.7969
plot(preds,xlab="Weeks", main="Forecasts", ylab="Predictions")
After all of tests and operations done above, arimax2 (regressor as basket count) model is selected as final model.
Necessary starting operations are done again.
library(data.table)
library(ggplot2)
library(dplyr)
require(mgcv)
library(fpp)
require(gratia)
library(readxl)
library(lubridate)
library(forecast)
library(zoo)
library(stringr)
library(stats)
library(urca)
g13<-read_excel("g13.xlsx")
dat<-as.data.table(g13)
mont<-dat[2605:2976,]
head(mont)
## event_date product_content_id price sold_count visit_count basket_count
## 1: 2020-05-25 48740784 NA 0 0 0
## 2: 2020-05-26 48740784 NA 0 0 0
## 3: 2020-05-27 48740784 NA 0 0 0
## 4: 2020-05-28 48740784 NA 0 0 0
## 5: 2020-05-29 48740784 NA 0 0 0
## 6: 2020-05-30 48740784 NA 0 0 0
## favored_count category_sold category_visits category_basket category_favored
## 1: 0 15 1002 0 6866
## 2: 0 14 1091 0 7201
## 3: 0 16 1036 0 6331
## 4: 0 17 846 0 5708
## 5: 0 21 932 0 5482
## 6: 0 15 778 0 5275
## category_brand_sold ty_visits Type
## 1: 0 1 Mont
## 2: 0 1 Mont
## 3: 0 1 Mont
## 4: 0 1 Mont
## 5: 0 1 Mont
## 6: 0 1 Mont
tail(mont)
## event_date product_content_id price sold_count visit_count basket_count
## 1: 2021-05-26 48740784 299.990 4 93 6
## 2: 2021-05-27 48740784 299.990 1 104 7
## 3: 2021-05-28 48740784 299.990 1 93 5
## 4: 2021-05-29 48740784 299.990 3 212 13
## 5: 2021-05-30 48740784 299.990 3 281 17
## 6: 2021-05-31 48740784 449.985 2 221 15
## favored_count category_sold category_visits category_basket category_favored
## 1: 5 19 786 199872 5881
## 2: 4 15 935 218188 6942
## 3: 8 12 1805 263575 10811
## 4: 13 2598 357502 16389 22636
## 5: 16 1360 288949 9436 18708
## 6: 22 1684 329087 12614 24534
## category_brand_sold ty_visits Type
## 1: 12184 106195988 Mont
## 2: 13930 107391579 Mont
## 3: 17912 103514886 Mont
## 4: 15 129670029 Mont
## 5: 16 131821083 Mont
## 6: 12 125439876 Mont
mont[is.na(price)==T,price:=0] ## filling the missing price values
mont
## event_date product_content_id price sold_count visit_count basket_count
## 1: 2020-05-25 48740784 0.000 0 0 0
## 2: 2020-05-26 48740784 0.000 0 0 0
## 3: 2020-05-27 48740784 0.000 0 0 0
## 4: 2020-05-28 48740784 0.000 0 0 0
## 5: 2020-05-29 48740784 0.000 0 0 0
## ---
## 368: 2021-05-27 48740784 299.990 1 104 7
## 369: 2021-05-28 48740784 299.990 1 93 5
## 370: 2021-05-29 48740784 299.990 3 212 13
## 371: 2021-05-30 48740784 299.990 3 281 17
## 372: 2021-05-31 48740784 449.985 2 221 15
## favored_count category_sold category_visits category_basket
## 1: 0 15 1002 0
## 2: 0 14 1091 0
## 3: 0 16 1036 0
## 4: 0 17 846 0
## 5: 0 21 932 0
## ---
## 368: 4 15 935 218188
## 369: 8 12 1805 263575
## 370: 13 2598 357502 16389
## 371: 16 1360 288949 9436
## 372: 22 1684 329087 12614
## category_favored category_brand_sold ty_visits Type
## 1: 6866 0 1 Mont
## 2: 7201 0 1 Mont
## 3: 6331 0 1 Mont
## 4: 5708 0 1 Mont
## 5: 5482 0 1 Mont
## ---
## 368: 6942 13930 107391579 Mont
## 369: 10811 17912 103514886 Mont
## 370: 22636 15 129670029 Mont
## 371: 18708 16 131821083 Mont
## 372: 24534 12 125439876 Mont
mont_sold_ts<-ts(mont$sold_count)
ts.plot(mont_sold_ts)
acf(mont_sold_ts)
We do not have the sufficient sales data to build an arima model. Therefore, we will use linear regression models.
Firstly, we will prepare the data table including our attributes for lm model.
mont_table<-data.table(sales=as.numeric(mont_sold_ts))
str(mont_table)
## Classes 'data.table' and 'data.frame': 372 obs. of 1 variable:
## $ sales: num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
daynumber<-seq(1,7,by=1)
mont_table<-cbind(mont_table,daynumber)
## Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names
## = check.names, : Item 2 has 7 rows but longest item has 372; recycled with
## remainder.
mont_table[,prices:=mont$price]
mont_table[,basket_count:=mont$basket_count]
mont_table[,trend:=1:.N]
mont_table
## sales daynumber prices basket_count trend
## 1: 0 1 0.000 0 1
## 2: 0 2 0.000 0 2
## 3: 0 3 0.000 0 3
## 4: 0 4 0.000 0 4
## 5: 0 5 0.000 0 5
## ---
## 368: 1 4 299.990 7 368
## 369: 1 5 299.990 5 369
## 370: 3 6 299.990 13 370
## 371: 3 7 299.990 17 371
## 372: 2 1 449.985 15 372
Now, We’re going to try different lm models with different regressors.
fit1<-lm(sales~trend+prices+basket_count+as.factor(daynumber),data = mont_table)
summary(fit1)
##
## Call:
## lm(formula = sales ~ trend + prices + basket_count + as.factor(daynumber),
## data = mont_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1332 -0.1590 0.0461 0.2495 10.2033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0472991 0.2509875 -0.188 0.851
## trend 0.0000672 0.0007415 0.091 0.928
## prices 0.0005427 0.0003565 1.523 0.129
## basket_count 0.1670093 0.0050155 33.299 <2e-16 ***
## as.factor(daynumber)2 -0.1753545 0.2963395 -0.592 0.554
## as.factor(daynumber)3 0.2022230 0.2971146 0.681 0.497
## as.factor(daynumber)4 0.1334307 0.2965766 0.450 0.653
## as.factor(daynumber)5 -0.1574852 0.2968562 -0.531 0.596
## as.factor(daynumber)6 -0.2203497 0.2962633 -0.744 0.458
## as.factor(daynumber)7 -0.4448828 0.2962676 -1.502 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.532 on 362 degrees of freedom
## Multiple R-squared: 0.8168, Adjusted R-squared: 0.8122
## F-statistic: 179.3 on 9 and 362 DF, p-value: < 2.2e-16
checkresiduals(fit1)
##
## Breusch-Godfrey test for serial correlation of order up to 13
##
## data: Residuals
## LM test = 105.22, df = 13, p-value < 2.2e-16
fit2<-lm(sales~ prices+basket_count,data = mont_table)
summary(fit2)
##
## Call:
## lm(formula = sales ~ prices + basket_count, data = mont_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5477 0.1302 0.1323 0.1323 10.3880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1322921 0.0873130 -1.515 0.131
## prices 0.0005495 0.0003541 1.552 0.122
## basket_count 0.1674479 0.0049798 33.626 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.531 on 369 degrees of freedom
## Multiple R-squared: 0.8134, Adjusted R-squared: 0.8124
## F-statistic: 804.1 on 2 and 369 DF, p-value: < 2.2e-16
checkresiduals(fit2)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 83.381, df = 10, p-value = 1.088e-13
fit3<-lm(sales~0+prices+basket_count,data = mont_table)
summary(fit3)
##
## Call:
## lm(formula = sales ~ 0 + prices + basket_count, data = mont_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.49 0.00 0.00 0.00 10.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## prices 0.0003764 0.0003358 1.121 0.263
## basket_count 0.1669916 0.0049794 33.537 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.534 on 370 degrees of freedom
## Multiple R-squared: 0.8221, Adjusted R-squared: 0.8211
## F-statistic: 854.7 on 2 and 370 DF, p-value: < 2.2e-16
checkresiduals(fit3)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 83.691, df = 10, p-value = 9.455e-14
Models we build do not have satisfactory ACF plots. However, their mean residuals are close to 0. We will go on with the fit3 model since it has the highest R-squared value. We will add fitted values and residuals to our data table.
mont_table[,fitted1:=fitted(fit3)]
mont_table[,residuals1:=residuals(fit3)]
mont_table
## sales daynumber prices basket_count trend fitted1 residuals1
## 1: 0 1 0.000 0 1 4.996004e-15 -4.996004e-15
## 2: 0 2 0.000 0 2 -6.026343e-14 6.026343e-14
## 3: 0 3 0.000 0 3 0.000000e+00 0.000000e+00
## 4: 0 4 0.000 0 4 0.000000e+00 0.000000e+00
## 5: 0 5 0.000 0 5 0.000000e+00 0.000000e+00
## ---
## 368: 1 4 299.990 7 368 1.281849e+00 -2.818488e-01
## 369: 1 5 299.990 5 369 9.478656e-01 5.213444e-02
## 370: 3 6 299.990 13 370 2.283799e+00 7.162013e-01
## 371: 3 7 299.990 17 371 2.951765e+00 4.823478e-02
## 372: 2 1 449.985 15 372 2.674236e+00 -6.742356e-01
We will visualize our fitted values and actual values.
mont_table %>%
ggplot(aes(x=fitted1, y=residuals1)) +
geom_point()
mont_table %>%
ggplot(aes(x=fitted1, y=sales)) +
geom_point() +
geom_abline(slope=1, intercept=0)
Now, we will add price and basket count values for the days we will predict. We used moving averages method to predict prices and basket count by using values of last 7 observations. We also added trend and daynumbers to our data table.
basket1<-c(6,6,7,5,13,17,15)
for (i in 1:26) {
basket1<-append(basket1,(basket1[i]+basket1[i+1]+basket1[i+2]+basket1[i+3]+basket1[i+4]+basket1[i+5]+basket1[i+6])/7)
}
basket_rounded<-round(basket1)
basket_rounded
## [1] 6 6 7 5 13 17 15 10 10 11 12 13 12 12 11 12 12 12 12 12 12 12 12 12 12
## [26] 12 12 12 12 12 12 12 12
price1<-c(299.99,299.99,299.99,299.99,299.99,299.99,499.99)
for (j in 1:26) {
price1<-append(price1,(price1[j]+price1[j+1]+price1[j+2]+price1[j+3]+price1[j+4]+price1[j+5]+price1[j+6])/7)
}
price1
## [1] 299.9900 299.9900 299.9900 299.9900 299.9900 299.9900 499.9900 328.5614
## [9] 332.6431 337.3078 342.6389 348.7316 355.6947 363.6525 344.1757 346.4063
## [17] 348.3725 349.9532 350.9981 351.3218 350.6972 348.8464 349.5136 349.9575
## [25] 350.1840 350.2169 350.1054 349.9316 349.8222 349.9616 350.0256 350.0353
## [33] 350.0141
a<-seq(2,7)
b<-rep(seq(1,7),2)
c<-seq(1,6)
d<-c(a,b,c)
day_info<-d
mont_table=rbind(mont_table,data.table(daynumber=as.factor(day_info)),fill=T)
mont_table[,trend:=1:.N]
mont_table[is.na(prices)==T,prices:=price1[8:33]]
mont_table[is.na(basket_count)==T,basket_count:=basket_rounded[8:33]]
mont_table
## sales daynumber prices basket_count trend fitted1 residuals1
## 1: 0 1 0.0000 0 1 4.996004e-15 -4.996004e-15
## 2: 0 2 0.0000 0 2 -6.026343e-14 6.026343e-14
## 3: 0 3 0.0000 0 3 0.000000e+00 0.000000e+00
## 4: 0 4 0.0000 0 4 0.000000e+00 0.000000e+00
## 5: 0 5 0.0000 0 5 0.000000e+00 0.000000e+00
## ---
## 394: NA 2 349.8222 12 394 NA NA
## 395: NA 3 349.9616 12 395 NA NA
## 396: NA 4 350.0256 12 396 NA NA
## 397: NA 5 350.0353 12 397 NA NA
## 398: NA 6 350.0141 12 398 NA NA
Now, we are ready to predict sales.
predict(fit3,mont_table[is.na(fitted1)==T])
## 1 2 3 4 5 6 7 8
## 1.793577 1.795113 1.963861 2.132859 2.302144 2.137773 2.140768 1.966446
## 9 10 11 12 13 14 15 16
## 2.134277 2.135017 2.135612 2.136005 2.136127 2.135892 2.135195 2.135446
## 17 18 19 20 21 22 23 24
## 2.135613 2.135699 2.135711 2.135669 2.135604 2.135562 2.135615 2.135639
## 25 26
## 2.135643 2.135635
mont_table[is.na(fitted1)==T,fitted1:=predict(fit3,mont_table[is.na(fitted1)==T])]
print(tail(mont_table$fitted1,26))
## [1] 1.793577 1.795113 1.963861 2.132859 2.302144 2.137773 2.140768 1.966446
## [9] 2.134277 2.135017 2.135612 2.136005 2.136127 2.135892 2.135195 2.135446
## [17] 2.135613 2.135699 2.135711 2.135669 2.135604 2.135562 2.135615 2.135639
## [25] 2.135643 2.135635
ts.plot(tail(mont_table$fitted1,26), main="Forecast for Sales")
In this step, seasonality of each product is analyzed to make appropriate decomposition.
The plot below shows the sold count of product-8 for related period.
As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days.The variance seems not so constant. That idea direct us to check autocorrelation so that determine related seasonality.
Table below shows autocorrelation values at each lag until lag 100.
It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 10th and 44th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-8 ought to be chosen at the frequency of 10, which will cover other autocorrelations too. However, as it will be observed in the next step; if lag 7 is chosen, the decomposition results will be better compared to lag 10. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.
As far as it is seen below, there is a decreasing then increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(bik_1$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series are values that are not above the significance levels. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations in the later parts of the time series data which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(bik_1$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
The reason why acf values are higher at frequency 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month. (Source) This reality may give us low acf values but this product did not give the values we want. Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.
Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series are not so fluctuating but it moves up an down through time. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
Required codes are below:
ts13=ts(bik_1$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")
Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and nonzero mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, first two acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
As it is understood form ACF and PACF plots, both have high correlations in some parts of the plots. For decreasing ACF part, PACF plot does not show significant figures after lag 3. AR(3) model can be proposed. For decreasing PACF part, ACF plot does not show significant figures after lag 2, so MA(2) is proposed. We can also try auto.arima and compare these 3 to select the best model. It will probably the 3rd model with the auto.arima gives us the best model to build on. Additionaly, it should be mentioned that there is no sign of seasonal arima here.
AR(3) is creted below:
model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.2455 -0.3144 -0.1884 0.9900
## s.e. 0.1012 0.0982 0.1003 0.0232
##
## sigma^2 estimated as 0.07927: log likelihood = -14.62, aic = 39.24
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are close to real sales amount (black). However this is a simple model and does not fit a high percentage amount of data points. They are little lagged because of the nature of AR models.
Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs that are above the significance levels.
MA(2) is creted below:
model12=arima(decomposed11$random,order=c(0,0,2))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 2))
##
## Coefficients:
## ma1 ma2 intercept
## -0.2064 -0.6247 0.9907
## s.e. 0.0739 0.0703 0.0055
##
## sigma^2 estimated as 0.08212: log likelihood = -16.74, aic = 41.49
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too. It is lagged due to the nature of the MA models.
Residuals show constant mean but variance seems higher first then lower in the later parts in this model. Eventhough ACF and PACF values are better in this model, they are still high.
Auto Arima is created below:
model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 0.9430 -0.4970 -0.9058 0.9906
## s.e. 0.0924 0.0895 0.0542 0.0051
##
## sigma^2 estimated as 0.06996: log likelihood=-7.22
## AIC=24.44 AICc=25.11 BIC=37.21
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black). The model fits more when variance is low. Outlier points lowers the fitting point of our model.
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals show constant mean but variance differs sometimes but better than the two previous models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.
AIC(model11)
## [1] 39.24273
BIC(model11)
## [1] 52.01212
AIC(model12)
## [1] 41.48683
BIC(model12)
## [1] 51.70234
AIC(model13)
## [1] 24.44036
BIC(model13)
## [1] 37.20975
As it is seen above, model13 which is autoarima with ARIMA(2,0,1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,98),95))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2446731 6.699741 0.2042481 0.2042481
accu(ts11,tail(head(fitted_transformed12,98),95))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2386745 6.370985 0.1942256 0.1942256
accu(ts11,tail(head(fitted_transformed13,98),95))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2253659 6.077611 0.1852818 0.1852818
Again third result which is ARIMA(2,0,1) gives the best result with lower errors in MAPE, MAD, MADP, WMAPE tests. The remaining tests are equal in all three models due to the saem data we use in all our models.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
correlation(bik_1)
## Variable Correlation
## 1 visit count 0.7099228
## 2 basket count 0.7960412
## 3 favored count 0.4806033
## 4 category sold 0.6833365
## 5 category visit 0.2349762
## 6 category basket 0.4955038
## 7 category favored 0.4896690
## 8 category brand sold 0.5047960
cor(bik_1$sold_count,bik_1$price)
## [1] -0.05910234
Obviously there is high correlation between sold counts and those basket count, visit count and category sold. This result actually makes sense because people tend to buy what they visit more and when the category has raised its attention all products sales’ will increase as well. Thanks to function created above, possible regressors are selected easily. However, potential problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potential regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_1$basket_count
## 0.771 -0.4691 -0.6327 0.7984 0.001
## s.e. NaN 0.0375 NaN NaN NaN
##
## sigma^2 estimated as 0.06242: log likelihood = -3.35, aic = 18.69
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$visit_count)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$visit_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_1$visit_count
## 0.8744 -0.505 -0.8680 0.9139 0
## s.e. 0.0942 0.089 0.0636 0.0005 NaN
##
## sigma^2 estimated as 0.06188: log likelihood = -3.36, aic = 18.71
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$category_sold)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$category_sold)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_1$category_sold
## 0.9439 -0.4990 -1.0000 0.9718 0
## s.e. 0.0880 0.0875 0.0332 0.0001 NaN
##
## sigma^2 estimated as 0.06025: log likelihood = -3.59, aic = 19.18
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,98),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007614923 0.305363 0.009309287 0.009309287
accu(ts11,tail(head(fitted_transformedx1,98),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007010133 0.2776748 0.008465184 0.008465184
accu(ts11,tail(head(fitted_transformedx2,98),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007210265 0.277234 0.008451745 0.008451745
accu(ts11,tail(head(fitted_transformedx3,98),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.00780154 0.2981519 0.009089447 0.009089447
As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.
After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.
Preparing basket count values with moving averages method:
basket1<-tail(bik_1$basket_count,7)
for (i in 1:26) {
basket1<-append(basket1,(basket1[i]+basket1[i+1]+basket1[i+2]+basket1[i+3]+basket1[i+4]+basket1[i+5]+basket1[i+6])/7)
}
basket_rounded<-round(basket1)
basket_rounded
## [1] 186 192 225 222 263 339 316 249 258 267 273 281 283 275 270 273 275 276 276
## [20] 275 274 274 275 275 275 275 275 275 275 275 275 275 275
Forecasting results are below:
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$basket_count)
basket_count_forxreg<-c(tail(bik_1$basket_count,3),basket_rounded[8:33])
preds<-predict(modelx1,n.ahead = 29,newxreg = basket_count_forxreg)$pred
preds<-ts(preds,frequency = 7)
last_trend_value <-tail(decomposed11$trend[!is.na(decomposed11$trend)],1)
seasonality=decomposed11$seasonal[1:28]
preds=preds*last_trend_value*seasonality
## Warning in `*.default`(preds * last_trend_value, seasonality): uzun olan nesne
## uzunluğu kısa olan nesne uzunluğunun bir katı değil
print(preds)
## Time Series:
## Start = c(1, 1)
## End = c(5, 1)
## Frequency = 7
## [1] 41.94351 53.31056 47.89224 49.88406 52.03646 49.38292 48.98725 42.32421
## [9] 51.75573 47.04757 51.12193 52.48690 49.43094 48.99795 42.17668 51.48428
## [17] 47.05343 51.30102 52.55151 49.41184 48.95240 42.14537 51.49130 47.09735
## [25] 51.34526 52.54912 49.41105 48.95284 42.14598
ts.plot(preds,xlab = "Weeks", ylab="Predictions", main="Forecasts")
In this step, seasonality of each product is analyzed to make appropriate decomposition.
The plot below shows the sold count of product-9 for related period.
As it is seen above, there is a pattern which fluctuates up and down.The variance seems increasing through time. That idea direct us to check autocorrelation so that determine related seasonality.
Table below shows autocorrelation values at each lag until lag 100.
The first lags have decreased over time and that reflects us a increasing trend of our data. Since our data is mostly deleted there are only 2 significant ACF values of lag1 and lag2. We can not speak certainly because of the dirty nature of data but it would not be incorrect to say that autocorrelation increases a little at lag 7, which makes one think of weekly seasonality.
As far as it is seen below, there is a decreasing then increasing variance. So best way of decomposing is utilizing multiplicative method.
Required codes are below:
ts11=ts(bik_2$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")
ACF of decomposed series are values that are not above the significance levels. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.
Required codes are below:
ts12=ts(bik_2$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")
The reason why acf values are higher at frequency 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month. (Source) This reality may give us low acf values and in the lag 1.0 this gives us a significant autocorrelation. It corresponds to the 15th day of the tiem series data. Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.
Detrended values (random) are less stationary in this case. They do not have constant variance but their mean isconstant compared to previous one. Trend in this series are not so fluctuating but it moves up an down through time. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.
For the further parts, we will continue with the model of frequency = 7.
Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.
There is no clear sign of which model to use. So, AR(2),MA(1) and ARIMA(2,0,1) will be examined.
AR(2) is creted below:
model11=arima(decomposed11$random,order=c(2,0,0))
print(model11)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 0.3203 -0.3775 0.9892
## s.e. 0.0948 0.0940 0.0280
##
## sigma^2 estimated as 0.08229: log likelihood = -16.35, aic = 40.7
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")
Fitted values (blue) are close to real sales amount (black). However this is a simple model with not a lot of data points and does not fit a high percentage amount of data points. They are little lagged because of the nature of AR models.
Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs that are above the significance levels at lag 7.
MA(1) is creted below:
model12=arima(decomposed11$random,order=c(0,0,1))
print(model12)
##
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 1))
##
## Coefficients:
## ma1 intercept
## 0.3650 0.9880
## s.e. 0.0936 0.0423
##
## sigma^2 estimated as 0.09173: log likelihood = -21.4, aic = 48.79
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")
Fitted values (blue) are really close to real sales amount (black) in this model too. It is lagged due to the nature of the MA models.
Residuals show fluctuating mean and variance seems not very good in this model.However, ACF and PACF values are better in this model,
Arima(2,0,1) is created below:
model13=arima(decomposed11$random,order =c(2,0,1))
print(model13)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1))
##
## Coefficients:
## ar1 ar2 ma1 intercept
## 0.9430 -0.4970 -0.9058 0.9906
## s.e. 0.0924 0.0895 0.0542 0.0051
##
## sigma^2 estimated as 0.06702: log likelihood = -7.22, aic = 24.44
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")
Fitted values (blue) are again really close to real sales amount (black). The model fits more when variance is low. Outlier points lowers the fitting point of our model.
plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")
tsdisplay(fitted13,main="ARIMA(2,0,1)")
Residuals show constant mean but variance is too high and fluctuating.
AIC(model11)
## [1] 40.69692
BIC(model11)
## [1] 50.91243
AIC(model12)
## [1] 48.79104
BIC(model12)
## [1] 56.45267
AIC(model13)
## [1] 24.44036
BIC(model13)
## [1] 37.20975
As it is seen above, model13 which is autoarima with ARIMA(2,0,1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.
Test for fitted values:
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
MAPE=sum(abs(error/actual))/n
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
return(l)
}
accu(ts11,tail(head(fitted_transformed11,31),28))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1186466 2.884616 0.08794029 0.08794029
accu(ts11,tail(head(fitted_transformed12,31),28))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1247375 2.98147 0.09089297 0.09089297
accu(ts11,tail(head(fitted_transformed13,31),28))
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1132647 2.620198 0.07987925 0.07987925
Again third result which is ARIMA(2,0,1) gives the best result with lower errors in MAPE, MAD, MADP, WMAPE tests.
In this step, possible regressors will be examined to be used on arimax model.
Code below shows correlation between sold count and the data categories.
correlation=function(dt){ # Function to compute correlation between sold counts and other columns in the data
dt=as.data.frame(dt)
a=data.frame(data= NA, nrow=8)
colnames(a)=c("Variable","Correlation")
a[1,1]="visit count"
a[2,1]="basket count"
a[3,1]="favored count"
a[4,1]="category sold"
a[5,1]="category visit"
a[6,1]="category basket"
a[7,1]="category favored"
a[8,1]="category brand sold"
for(i in 1:8) {
a[i,2]=cor(dt[,4],dt[,i+4])
}
return (a)
}
correlation(bik_2)
## Variable Correlation
## 1 visit count 0.7099228
## 2 basket count 0.7960412
## 3 favored count 0.4806033
## 4 category sold 0.6833365
## 5 category visit 0.2349762
## 6 category basket 0.4955038
## 7 category favored 0.4896690
## 8 category brand sold 0.5047960
cor(bik_2$sold_count,bik_2$price)
## [1] -0.05910234
Obviously there is high correlation between sold counts and those basket count, visit count and favıred count. This result actually makes sense because people tend to buy what they visit more and when the category has raised its attention all products sales’ will increase as well. Thanks to function created above, possible regressors are selected easily. However, potential problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.
In this step, potential regressors will be used in the model and arimax models will be created.
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$basket_count)
print(modelx1)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$basket_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_2$basket_count
## 0.771 -0.4691 -0.6327 0.7984 0.001
## s.e. NaN 0.0375 NaN NaN NaN
##
## sigma^2 estimated as 0.06242: log likelihood = -3.35, aic = 18.69
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend
modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$visit_count)
print(modelx2)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$visit_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_2$visit_count
## 0.8744 -0.505 -0.8680 0.9139 0
## s.e. 0.0942 0.089 0.0636 0.0005 NaN
##
## sigma^2 estimated as 0.06188: log likelihood = -3.36, aic = 18.71
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend
modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$favored_count)
print(modelx3)
##
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$favored_count)
##
## Coefficients:
## ar1 ar2 ma1 intercept bik_2$favored_count
## 0.8885 -0.5054 -0.8556 0.9579 1e-04
## s.e. 0.0922 0.0886 0.0557 NaN NaN
##
## sigma^2 estimated as 0.06526: log likelihood = -5.83, aic = 23.65
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.
accu(ts11,tail(head(fitted_transformed13,31),7)) # arima model
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04803259 0.6914002 0.021078 0.021078
accu(ts11,tail(head(fitted_transformedx1,31),7)) # arimax1
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.03311664 0.5477929 0.01669999 0.01669999
accu(ts11,tail(head(fitted_transformedx2,31),7)) # arimax2
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04203058 0.6142789 0.01872689 0.01872689
accu(ts11,tail(head(fitted_transformedx3,31),7)) # arimax3
## n mean sd CV MAPE MAD MADP WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04493964 0.6470274 0.01972526 0.01972526
AIC(model13)
## [1] 24.44036
AIC(modelx1)
## [1] 18.69395
AIC(modelx2)
## [1] 18.71035
AIC(modelx3)
## [1] 23.65499
As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.
After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.
Preparing basket count values with moving averages method:
basket1<-tail(bik_1$basket_count,7)
for (i in 1:26) {
basket1<-append(basket1,(basket1[i]+basket1[i+1]+basket1[i+2]+basket1[i+3]+basket1[i+4]+basket1[i+5]+basket1[i+6])/7)
}
basket_rounded<-round(basket1)
basket_rounded
## [1] 186 192 225 222 263 339 316 249 258 267 273 281 283 275 270 273 275 276 276
## [20] 275 274 274 275 275 275 275 275 275 275 275 275 275 275
Forecasting results are below:
modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$basket_count)
basket_count_forxreg<-c(tail(bik_2$basket_count,3),basket_rounded[8:33])
preds<-predict(modelx1,n.ahead = 29,newxreg = basket_count_forxreg)$pred
preds<-ts(preds,frequency = 7)
last_trend_value <-tail(decomposed11$trend[!is.na(decomposed11$trend)],1)
seasonality=decomposed11$seasonal[1:28]
preds=preds*last_trend_value*seasonality/2
## Warning in `*.default`(preds * last_trend_value, seasonality): uzun olan nesne
## uzunluğu kısa olan nesne uzunluğunun bir katı değil
print(preds)
## Time Series:
## Start = c(1, 1)
## End = c(5, 1)
## Frequency = 7
## [1] 20.97175 26.65528 23.94612 24.94203 26.01823 24.69146 24.49363 21.16210
## [9] 25.87787 23.52378 25.56096 26.24345 24.71547 24.49898 21.08834 25.74214
## [17] 23.52672 25.65051 26.27575 24.70592 24.47620 21.07269 25.74565 23.54868
## [25] 25.67263 26.27456 24.70552 24.47642 21.07299
ts.plot(preds,xlab = "Weeks", ylab="Predictions", main="Forecasts")
During the project, some actions are operated and steps are taken. At the end, forecast values are acquired for related period. However, model for bikini 1 and bikini 2 was ineffective due to limited data. The other predictions are quite acceptable and more realistic with a better model at each. For further extensions or approaches, it would be perfect to use qualitative methods combined to these model results. After presenting model, expert opinion should be taken for new analyses. Additionally, creating confidence intervals for these predictions would give one more information, which is a huge advantage for further analyses. In short, combining these results with confidence intervals and qualitative models is the best technique to forecast.